Table of Contents¶

  1. Import Libraries
  2. Introduction
    • Key Steps in This Analysis
  3. Dataset Overview
    • Preliminary Data Checks
  4. Exploratory Data Analysis
    • Correlation Between DI, MENTHLTH, and ADDEPEV3
    • Boxplot of DI Distribution by State
    • Choropleth Map of DI Across US States
  5. Conclusion
    • Summary of Findings
    • DI Verification with MENTHLTH and ADDEPEV3
    • State-Level Differences in DI
    • Geographic Patterns in DI
    • Key Takeaways
    • Future Considerations

Import Libraries¶

This section includes all necessary libraries for data manipulation, visualization, and analysis.

In [1]:
# Suppress warnings to avoid clutter in the output
import warnings
warnings.simplefilter(action='ignore')

# Libraries for data manipulation
import pandas as pd
import numpy as np

# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Library for preprocessing (scaling values)
from sklearn.preprocessing import MinMaxScaler

PATH = 'data/'

Introduction¶

This notebook presents an analysis of the Depression Index (DI) using the analysis_2022.csv dataset, which was created by Pete using data from 4-V-Merge_datasets.ipynb.

The goal of this analysis is to:

  1. Verify that DI maintains its intended relationship with its component variables (MENTHLTH and ADDEPEV3).
  2. Analyze how DI varies across U.S. states, using statistical summaries and visualizations.

Key Steps in This Analysis¶

  • Dataset: The dataset used in this notebook is the analysis_2022.csv file, derived from 4-V-Merge_datasets.ipynb.
  • DI Construction:
    • DI is derived from MENTHLTH (number of days mental health was not good) and ADDEPEV3 (depression diagnosis).
    • MENTHLTH ranges from 0 to 30, with:
      • 88 ("None") encoded as 0.
      • 77 ("Don’t know") and 99 ("Refused") excluded.
    • DI is rescaled using MinMax scaling to a 0 to 1 range for better visualization.
  • Analysis Approach:
    • Correlation Verification: Ensure DI correctly aligns with MENTHLTH and ADDEPEV3.
    • State-Level Analysis: Examine DI’s variation across U.S. states.
    • Visualizations: Use a boxplot and a choropleth map to highlight trends.
  • No Imputation: The analysis is performed on the dataset without imputation, using the data as originally processed.

Dataset Overview¶

In this section, we load the analysis_2022.csv dataset, which contains the Depression Index (DI) and related variables. We will inspect its structure and explore some basic characteristics.

Preliminary Data Checks¶

  • Check the dataset's structure, including the number of rows, columns, and data types.
  • Identify missing values.
  • Examine the first few records to understand the data format.
In [2]:
# Load the BRFSS Depression Index dataset
df = pd.read_csv(PATH + "brfss/analysis_2022.csv")

# Preview the first few rows to check structure and initial data format
df.head()
Out[2]:
State DH DI ADDEPEV3 MENTHLTH DH_Z DI_Z
0 AL 10.033333 0.000000 0.0 0.0 1.137661 -0.598581
1 AL 10.033333 0.000000 0.0 0.0 1.137661 -0.598581
2 AL 10.033333 0.444444 0.0 0.8 1.137661 -0.450494
3 AL 10.033333 0.000000 0.0 0.0 1.137661 -0.598581
4 AL 10.033333 0.000000 0.0 0.0 1.137661 -0.598581
In [3]:
# Display dataset structure
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402198 entries, 0 to 402197
Data columns (total 7 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   State     402198 non-null  object 
 1   DH        402198 non-null  float64
 2   DI        402198 non-null  float64
 3   ADDEPEV3  402198 non-null  float64
 4   MENTHLTH  402198 non-null  float64
 5   DH_Z      402198 non-null  float64
 6   DI_Z      402198 non-null  float64
dtypes: float64(6), object(1)
memory usage: 21.5+ MB
In [4]:
# Identify missing values in each column to assess data completeness
print("Missing Values Per Column:")
df.isnull().sum()
Missing Values Per Column:
Out[4]:
State       0
DH          0
DI          0
ADDEPEV3    0
MENTHLTH    0
DH_Z        0
DI_Z        0
dtype: int64

Exploratory Data Analysis¶

This section explores the Depression Index (DI) in relation to:

  • MENTHLTH (Days mental health was not good)
  • ADDEPEV3 (Depression diagnosis)
  • State-level variations in DI

We will:

  • Verify the correlation between DI and its component variables.
  • Visualize DI distribution across states using:
    • A boxplot (state-wise DI comparison)
    • A choropleth map (geographic DI distribution)

Correlation Between DI, MENTHLTH, and ADDEPEV3¶

Since DI is derived from MENTHLTH and ADDEPEV3, a correlation is expected.
This step is included for completeness to confirm the expected relationships.

In [5]:
# Compute correlation between DI, MENTHLTH, and ADDEPEV3 to examine relationships
correlation_matrix = df[['DI', 'MENTHLTH', 'ADDEPEV3']].corr()

# Display the computed correlation matrix
print(correlation_matrix)
                DI  MENTHLTH  ADDEPEV3
DI        1.000000  0.736028  0.927771
MENTHLTH  0.736028  1.000000  0.430261
ADDEPEV3  0.927771  0.430261  1.000000

Boxplot of DI Distribution by State¶

This boxplot visualizes the Depression Index (DI) across U.S. states, sorted by Mean DI in descending order.

  • States are labeled with full names.
  • The x-axis is adjusted to range from 0 to 11 for clarity.
  • Outliers are included to show variations.
  • Higher Mean DI states appear at the top, providing insight into the states with higher average depression index values.
In [6]:
# Map state abbreviations to full state names for better readability
state_mapping = {
    "AL": "Alabama", "AK": "Alaska", "AZ": "Arizona", "AR": "Arkansas", "CA": "California",
    "CO": "Colorado", "CT": "Connecticut", "DE": "Delaware", "FL": "Florida", "GA": "Georgia",
    "HI": "Hawaii", "ID": "Idaho", "IL": "Illinois", "IN": "Indiana", "IA": "Iowa",
    "KS": "Kansas", "KY": "Kentucky", "LA": "Louisiana", "ME": "Maine", "MD": "Maryland",
    "MA": "Massachusetts", "MI": "Michigan", "MN": "Minnesota", "MS": "Mississippi", "MO": "Missouri",
    "MT": "Montana", "NE": "Nebraska", "NV": "Nevada", "NH": "New Hampshire", "NJ": "New Jersey",
    "NM": "New Mexico", "NY": "New York", "NC": "North Carolina", "ND": "North Dakota", "OH": "Ohio",
    "OK": "Oklahoma", "OR": "Oregon", "PA": "Pennsylvania", "RI": "Rhode Island", "SC": "South Carolina",
    "SD": "South Dakota", "TN": "Tennessee", "TX": "Texas", "UT": "Utah", "VT": "Vermont",
    "VA": "Virginia", "WA": "Washington", "WV": "West Virginia", "WI": "Wisconsin", "WY": "Wyoming"
}

# Ensure 'State_Full' is created before grouping
df['State_Full'] = df['State'].map(state_mapping)

# Remove rows where 'State_Full' is missing (i.e., states not found in the mapping)
df = df.dropna(subset=['State_Full'])

# Compute mean DI per state to determine ranking
mean_DI_by_state = df.groupby('State_Full')['DI'].mean().sort_values(ascending=False)

# Reorder 'State_Full' based on mean DI for proper ranking in visualization
df['State_Full'] = pd.Categorical(df['State_Full'], categories=mean_DI_by_state.index, ordered=True)

# Create a horizontal boxplot showing DI distribution across states (sorted by Mean DI)
plt.figure(figsize=(8, 12))

# Plot boxplot, showing IQR and outliers
sns.boxplot(x='DI', y='State_Full', data=df, showfliers=True, color="darkblue")

# Set x-axis range for better visualization
plt.xlim(0, 11)

# Add title and axis labels
plt.title("Distribution of DI Across U.S. States (Sorted by Mean DI)", fontsize=16)
plt.xlabel("Depression Index (DI)", fontsize=14)
plt.ylabel("State", fontsize=14)

# Adjust layout for better spacing
plt.tight_layout(pad=2)

# Display the boxplot
plt.show()
No description has been provided for this image

Interpretation of the Boxplot: DI Distribution Across U.S. States¶

This boxplot visualizes the Depression Index (DI) across U.S. states, ranked by Mean DI (highest to lowest). The blue bars represent the middle 50% of DI values (IQR), while the black dots indicate outliers. The states are sorted based on Mean DI, which explains why some states with smaller bars appear higher in the ranking.

For example, Oregon ranks higher than Kentucky despite having a smaller blue bar (IQR) because it has a slightly higher Mean DI. Kentucky, on the other hand, has greater variability in DI (wider IQR), meaning its middle 50% of DI values are more spread out. Similarly, states like Vermont and North Carolina appear at the same height, but their IQRs differ, meaning their middle 50% of DI values are distributed differently.

One notable observation is the gap in outliers starting after ~4.5 on the x-axis. This occurs because most states in this range have no DI values between ~4.5 and ~5.0, meaning their data clusters either below 4.5 or above 5.0. However, a few states, such as New Hampshire, Florida, Maryland, and New York, have data points near 4.5, forming distinct clusters of outliers. This explains why there is a visual break in the outliers—it accurately reflects the sparse distribution of DI values in this specific range rather than an artifact of the visualization.

This chart helps identify states with higher average DI scores while also showing how much variation exists within each state.

Choropleth Map of DI Across US States¶

This choropleth map highlights state-level DI variations,
using a yellow-to-blue color scale for visual clarity.

  • Yellow represents lower DI values.
  • Blue represents higher DI values.
  • MinMax Scaling normalizes DI values between 0 and 1 for better contrast.
In [7]:
# Normalize DI values for visualization
scaler = MinMaxScaler(feature_range=(0, 1))
df['DI_scaled'] = scaler.fit_transform(df[['DI']])

# Compute state-level averages
state_avg_di = df.groupby('State', as_index=False)['DI_scaled'].mean()

# Generate Choropleth Map
fig = px.choropleth(
    state_avg_di,
    locations='State',
    locationmode="USA-states",
    color='DI_scaled',
    color_continuous_scale=["yellow", "blue"],  # Color scale: yellow (low) to blue (high)
    title="Choropleth Map of Depression Index (DI) Across U.S. States",
    scope="usa",
    labels={'DI_scaled': 'Depression Index'}
)

# Adjust color scale dynamically to ensure consistency
fig.update_layout(
    coloraxis=dict(
        cmin=state_avg_di['DI_scaled'].min(),
        cmax=state_avg_di['DI_scaled'].max()
    ),
    coloraxis_colorbar=dict(
        title="Depression Index",
        tickvals=[state_avg_di['DI_scaled'].min(), state_avg_di['DI_scaled'].mean(), state_avg_di['DI_scaled'].max()],
        ticktext=[f"Low ({state_avg_di['DI_scaled'].min():.2f})",
                  f"Medium ({state_avg_di['DI_scaled'].mean():.2f})",
                  f"High ({state_avg_di['DI_scaled'].max():.2f})"],
        len=0.75,  # Taller color bar
        thickness=12,  # Make the color bar thinner
        y=0.5,  # Center the color bar
    ),
    width=900,  # Set a fixed width to prevent excessive stretching
    height=500,  # Adjust height to keep the map proportional
    margin=dict(l=50, r=50, t=50, b=50)  # Adjust margins for centering
)

# Show the map
fig.show()

Conclusion¶

This analysis examined the Depression Index (DI) at the state level, verifying its relationship with key mental health indicators and uncovering regional trends.


Summary of Findings¶

DI Verification with MENTHLTH and ADDEPEV3¶

  • As expected, DI is strongly correlated with MENTHLTH and ADDEPEV3, confirming that DI maintains the intended relationship with its source variables. This suggests that DI is a valid measure of self-reported mental health difficulties.

State-Level Differences in DI¶

  • The boxplot analysis, sorted by Mean DI in descending order, reveals significant variation across states.
  • States with higher Mean DI are ranked at the top, representing regions with higher depression burden.
  • A notable gap in outliers around DI 4.5 suggests that most states lack data points in this range, causing a visible break in the distribution.

Geographic Patterns in DI¶

  • The choropleth map highlights distinct state-level differences in DI distribution.
  • Certain states show consistently higher DI values, while others remain lower.
  • However, the observed patterns do not follow strict regional clustering, suggesting that socioeconomic conditions, healthcare access, or demographic variations could be influencing DI.

Key Takeaways¶

  • DI is a reliable metric as it aligns with expected relationships between MENTHLTH and ADDEPEV3.
  • State-level DI variation is substantial, revealing key differences in mental health burdens across the U.S..
  • A gap in DI values around 4.5 was observed, confirming that certain states have little to no data in this range.
  • Regional trends are complex, meaning that external factors beyond location alone likely influence mental health outcomes.
  • Visualizing DI helps provide insights into mental health disparities, which could support targeted policy interventions.

Future Considerations¶

This analysis provides a strong foundation for understanding state-level mental health variations using DI. Future research could extend these findings by:

  • Incorporating Socioeconomic and Healthcare Data: Exploring how income levels, access to mental healthcare, and other social determinants impact DI.
  • Analyzing Longitudinal Trends: Examining year-over-year changes in DI to identify shifting mental health patterns over time.
  • Refining DI Calculation Methods: Experimenting with alternative scaling or weighting approaches to enhance interpretability and accuracy.
  • Validating DI with External Datasets: Comparing DI against national health surveys or clinical data to assess its robustness as a mental health metric.

Record Dependencies¶

In [8]:
%load_ext watermark
%watermark
%watermark --iversions
Last updated: 2025-02-17T02:38:09.334387+00:00

Python implementation: CPython
Python version       : 3.10.11
IPython version      : 8.17.2

Compiler    : GCC 11.3.0
OS          : Linux
Release     : 6.5.0-1020-aws
Machine     : x86_64
Processor   : x86_64
CPU cores   : 64
Architecture: 64bit

plotly    : 5.24.1
seaborn   : 0.12.2
matplotlib: 3.7.1
pandas    : 2.0.2
numpy     : 1.24.3
sklearn   : 1.2.2

This concludes the Depression Index analysis. The findings from this notebook, along with the EDA and data preparation steps in the earlier notebooks, form the foundation of the BRFSS project section on the portfolio site.

In [ ]: